Interactive Multi-objective Reinforcement Learning in Multi-armed Bandits with Gaussian Process Utility Models
نویسندگان
چکیده
In interactive multi-objective reinforcement learning (MORL), an agent has to simultaneously learn about the environment and preferences of user, in order quickly zoom on those decisions that are likely be preferred by user. this paper we study MORL context multi-armed bandits. Contrary earlier approaches force utility user expressed as a weighted sum values for each objective, do not make such stringent priori assumptions. Specifically, only allow non-linear preferences, but also obviate need specify exact model class function must fall. To achieve this, propose new approach called Gaussian-process Utility Thompson Sampling (GUTS). GUTS employs parameterless Bayesian any type function, exploits monotonicity information, limits number queries posed ensuring questions statistically significant. We show empirically can regret highly sub-linear arm pulls. (A preliminary version work was presented at ALA workshop 2018 []).
منابع مشابه
Interactive Thompson Sampling for Multi-objective Multi-armed Bandits
In multi-objective reinforcement learning (MORL), much attention is paid to generating optimal solution sets for unknown utility functions of users, based on the stochastic reward vectors only. In online MORL on the other hand, the agent will often be able to elicit preferences from the user, enabling it to learn about the utility function of its user directly. In this paper, we study online MO...
متن کاملMulti-Objective X -Armed Bandits
Many of the standard optimization algorithms focus on optimizing a single, scalar feedback signal. However, real-life optimization problems often require a simultaneous optimization of more than one objective. In this paper, we propose a multi-objective extension to the standard X -armed bandit problem. As the feedback signal is now vector-valued, the goal of the agent is to sample actions in t...
متن کاملThompson Sampling for Multi-Objective Multi-Armed Bandits Problem
The multi-objective multi-armed bandit (MOMAB) problem is a sequential decision process with stochastic rewards. Each arm generates a vector of rewards instead of a single scalar reward. Moreover, these multiple rewards might be conflicting. The MOMAB-problem has a set of Pareto optimal arms and an agent’s goal is not only to find that set but also to play evenly or fairly the arms in that set....
متن کاملActive Learning in Multi-armed Bandits
We consider the problem of actively learning the mean values of distributions associated with a finite number of options (arms). The decision maker can select which option to generate the next sample from, the goal being to produce estimates with equally good precision for all the options. If sample means are used to estimate the unknown values then the optimal solution, assuming full knowledge...
متن کاملExploiting Similarity Information in Reinforcement Learning - Similarity Models for Multi-Armed Bandits and MDPs
This paper considers reinforcement learning problems with additional similarity information. We start with the simple setting of multi-armed bandits in which the learner knows for each arm its color, where it is assumed that arms of the same color have close mean rewards. An algorithm is presented that shows that this color information can be used to improve the dependency of online regret boun...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2021
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-030-67664-3_28